This project is about the factors or features that could help the loan applicant to get the approval status.
This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from functools import reduce
from datetime import datetime
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
df = pd.read_csv('prosperLoanData.csv')
display(df.head())
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | ... | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | ... | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | ... | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | ... | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | ... | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
5 rows × 81 columns
# Few variables are used for the project to simplify analysis:
cols = ['ListingKey',
'Term' ,
'LoanStatus',
'EstimatedEffectiveYield' ,
'BorrowerAPR' ,
'BorrowerRate' ,
'ProsperRating (numeric)',
'ProsperRating (Alpha)',
'ProsperScore',
'ListingCategory (numeric)',
'EmploymentStatus',
'Occupation',
'EmploymentStatusDuration',
'IsBorrowerHomeowner',
'IncomeVerifiable',
'StatedMonthlyIncome',
'MonthlyLoanPayment',
'Recommendations',
'DebtToIncomeRatio',
'LoanOriginalAmount' ,
'PercentFunded',
'IncomeRange',
'Investors',
'BorrowerState']
loan_data = df[cols]
display(loan_data.head())
| ListingKey | Term | LoanStatus | EstimatedEffectiveYield | BorrowerAPR | BorrowerRate | ProsperRating (numeric) | ProsperRating (Alpha) | ProsperScore | ListingCategory (numeric) | ... | IncomeVerifiable | StatedMonthlyIncome | MonthlyLoanPayment | Recommendations | DebtToIncomeRatio | LoanOriginalAmount | PercentFunded | IncomeRange | Investors | BorrowerState | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 36 | Completed | NaN | 0.16516 | 0.1580 | NaN | NaN | NaN | 0 | ... | True | 3083.333333 | 330.43 | 0 | 0.17 | 9425 | 1.0 | $25,000-49,999 | 258 | CO |
| 1 | 10273602499503308B223C1 | 36 | Current | 0.07960 | 0.12016 | 0.0920 | 6.0 | A | 7.0 | 2 | ... | True | 6125.000000 | 318.93 | 0 | 0.18 | 10000 | 1.0 | $50,000-74,999 | 1 | CO |
| 2 | 0EE9337825851032864889A | 36 | Completed | NaN | 0.28269 | 0.2750 | NaN | NaN | NaN | 0 | ... | True | 2083.333333 | 123.32 | 0 | 0.06 | 3001 | 1.0 | Not displayed | 41 | GA |
| 3 | 0EF5356002482715299901A | 36 | Current | 0.08490 | 0.12528 | 0.0974 | 6.0 | A | 9.0 | 16 | ... | True | 2875.000000 | 321.45 | 0 | 0.15 | 10000 | 1.0 | $25,000-49,999 | 158 | GA |
| 4 | 0F023589499656230C5E3E2 | 36 | Current | 0.18316 | 0.24614 | 0.2085 | 3.0 | D | 4.0 | 2 | ... | True | 9583.333333 | 563.97 | 0 | 0.26 | 15000 | 1.0 | $100,000+ | 20 | MN |
5 rows × 24 columns
display(loan_data.info())
display(loan_data.isnull().sum())
print('duplicate rows:',loan_data.duplicated().sum())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 Term 113937 non-null int64 2 LoanStatus 113937 non-null object 3 EstimatedEffectiveYield 84853 non-null float64 4 BorrowerAPR 113912 non-null float64 5 BorrowerRate 113937 non-null float64 6 ProsperRating (numeric) 84853 non-null float64 7 ProsperRating (Alpha) 84853 non-null object 8 ProsperScore 84853 non-null float64 9 ListingCategory (numeric) 113937 non-null int64 10 EmploymentStatus 111682 non-null object 11 Occupation 110349 non-null object 12 EmploymentStatusDuration 106312 non-null float64 13 IsBorrowerHomeowner 113937 non-null bool 14 IncomeVerifiable 113937 non-null bool 15 StatedMonthlyIncome 113937 non-null float64 16 MonthlyLoanPayment 113937 non-null float64 17 Recommendations 113937 non-null int64 18 DebtToIncomeRatio 105383 non-null float64 19 LoanOriginalAmount 113937 non-null int64 20 PercentFunded 113937 non-null float64 21 IncomeRange 113937 non-null object 22 Investors 113937 non-null int64 23 BorrowerState 108422 non-null object dtypes: bool(2), float64(10), int64(5), object(7) memory usage: 19.3+ MB
None
ListingKey 0 Term 0 LoanStatus 0 EstimatedEffectiveYield 29084 BorrowerAPR 25 BorrowerRate 0 ProsperRating (numeric) 29084 ProsperRating (Alpha) 29084 ProsperScore 29084 ListingCategory (numeric) 0 EmploymentStatus 2255 Occupation 3588 EmploymentStatusDuration 7625 IsBorrowerHomeowner 0 IncomeVerifiable 0 StatedMonthlyIncome 0 MonthlyLoanPayment 0 Recommendations 0 DebtToIncomeRatio 8554 LoanOriginalAmount 0 PercentFunded 0 IncomeRange 0 Investors 0 BorrowerState 5515 dtype: int64
duplicate rows: 0
Conclusion:
In the loan_data dataset, there have 113937 rows and 24 columns.All the data types are correct.There have missing values in the EstimatedEffectiveYield,BorrowerAPR , ProsperRating (numeric),ProsperRating (Alpha) , ProsperScore , EmploymentStatus , Occupation ,EmploymentStatusDuration,DebtToIncomeRatio,BorrowerState columns.
#convert term into category
loan_data['Term'] = loan_data['Term'].astype('category')
loan_data = loan_data.dropna()
display(loan_data.info())
<class 'pandas.core.frame.DataFrame'> Int64Index: 76216 entries, 1 to 113936 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 76216 non-null object 1 Term 76216 non-null category 2 LoanStatus 76216 non-null object 3 EstimatedEffectiveYield 76216 non-null float64 4 BorrowerAPR 76216 non-null float64 5 BorrowerRate 76216 non-null float64 6 ProsperRating (numeric) 76216 non-null float64 7 ProsperRating (Alpha) 76216 non-null object 8 ProsperScore 76216 non-null float64 9 ListingCategory (numeric) 76216 non-null int64 10 EmploymentStatus 76216 non-null object 11 Occupation 76216 non-null object 12 EmploymentStatusDuration 76216 non-null float64 13 IsBorrowerHomeowner 76216 non-null bool 14 IncomeVerifiable 76216 non-null bool 15 StatedMonthlyIncome 76216 non-null float64 16 MonthlyLoanPayment 76216 non-null float64 17 Recommendations 76216 non-null int64 18 DebtToIncomeRatio 76216 non-null float64 19 LoanOriginalAmount 76216 non-null int64 20 PercentFunded 76216 non-null float64 21 IncomeRange 76216 non-null object 22 Investors 76216 non-null int64 23 BorrowerState 76216 non-null object dtypes: bool(2), category(1), float64(10), int64(4), object(7) memory usage: 13.0+ MB
<ipython-input-6-b7fb12510a03>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
loan_data['Term'] = loan_data['Term'].astype('category')
None
#column label change
loan_data.rename(columns={'ProsperRating (Alpha)': 'ProsperRating_Alpha',
'ProsperRating (numeric)': 'ProsperRating_Numeric',
'ListingCategory (numeric)' : 'ListingCategory_Numeric'
}, inplace=True)
#making column names lower letter
loan_data.columns=loan_data.columns.str.lower()
display(loan_data.head(1))
| listingkey | term | loanstatus | estimatedeffectiveyield | borrowerapr | borrowerrate | prosperrating_numeric | prosperrating_alpha | prosperscore | listingcategory_numeric | ... | incomeverifiable | statedmonthlyincome | monthlyloanpayment | recommendations | debttoincomeratio | loanoriginalamount | percentfunded | incomerange | investors | borrowerstate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 10273602499503308B223C1 | 36 | Current | 0.0796 | 0.12016 | 0.092 | 6.0 | A | 7.0 | 2 | ... | True | 6125.0 | 318.93 | 0 | 0.18 | 10000 | 1.0 | $50,000-74,999 | 1 | CO |
1 rows × 24 columns
display(loan_data['incomerange'].unique())
array(['$50,000-74,999', '$25,000-49,999', '$100,000+', '$75,000-99,999',
'$1-24,999', 'Not employed'], dtype=object)
#income_group function have been created to categorize the income level.
def income_group(row):
incomerange = row['incomerange']
if incomerange == "$1-24,999":
return 'low income '
if incomerange == "$25,000-49,999":
return "low middle income"
if incomerange == "$50,000-74,999":
return "middle middle income"
if incomerange == "$75,000-999,999":
return "upper middle income"
if incomerange == "$100,000+":
return "high income"
if incomerange == "Not employed":
return "job less"
loan_data['income label'] = loan_data.apply(income_group, axis=1 ).sort_values(ascending=False)
display(loan_data.head())
| listingkey | term | loanstatus | estimatedeffectiveyield | borrowerapr | borrowerrate | prosperrating_numeric | prosperrating_alpha | prosperscore | listingcategory_numeric | ... | statedmonthlyincome | monthlyloanpayment | recommendations | debttoincomeratio | loanoriginalamount | percentfunded | incomerange | investors | borrowerstate | income label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 10273602499503308B223C1 | 36 | Current | 0.07960 | 0.12016 | 0.0920 | 6.0 | A | 7.0 | 2 | ... | 6125.000000 | 318.93 | 0 | 0.18 | 10000 | 1.0 | $50,000-74,999 | 1 | CO | middle middle income |
| 3 | 0EF5356002482715299901A | 36 | Current | 0.08490 | 0.12528 | 0.0974 | 6.0 | A | 9.0 | 16 | ... | 2875.000000 | 321.45 | 0 | 0.15 | 10000 | 1.0 | $25,000-49,999 | 158 | GA | low middle income |
| 4 | 0F023589499656230C5E3E2 | 36 | Current | 0.18316 | 0.24614 | 0.2085 | 3.0 | D | 4.0 | 2 | ... | 9583.333333 | 563.97 | 0 | 0.26 | 15000 | 1.0 | $100,000+ | 20 | MN | high income |
| 5 | 0F05359734824199381F61D | 60 | Current | 0.11567 | 0.15425 | 0.1314 | 5.0 | B | 10.0 | 1 | ... | 8333.333333 | 342.37 | 0 | 0.36 | 15000 | 1.0 | $100,000+ | 1 | NM | high income |
| 6 | 0F0A3576754255009D63151 | 36 | Current | 0.23820 | 0.31032 | 0.2712 | 2.0 | E | 2.0 | 1 | ... | 2083.333333 | 122.67 | 0 | 0.27 | 3000 | 1.0 | $25,000-49,999 | 1 | KS | low middle income |
5 rows × 25 columns
loan_status = loan_data.pivot_table(index = 'loanstatus', values = 'listingkey', aggfunc = 'nunique').reset_index()
loan_status.rename(columns = {'listingkey':'count'}, inplace = True)
loan_status.sort_values(by='count',ascending=False,inplace=True)
fig = px.bar(loan_status, x='loanstatus', y='count',
title='Loan status of all borrowers')
fig.show()
Conclusion:
LoanStatus of all Borrowers are with current and completed state.
employment_status = loan_data.pivot_table(index = 'employmentstatus', values = 'listingkey', aggfunc = 'nunique').reset_index()
employment_status.rename(columns = {'listingkey':'count'}, inplace = True)
employment_status.sort_values(by='count',ascending=False,inplace=True)
fig = px.bar(employment_status, x='employmentstatus', y='count',
title='Employment_status status of all borrowers')
fig.show()
Conclusion:
EmploymentStatus of all Borrowers are with Employed State and most of them are full time worker.
income_range = loan_data.pivot_table(index = 'income label', values = 'listingkey', aggfunc = 'nunique').reset_index()
income_range .rename(columns = {'listingkey':'count'}, inplace = True)
income_range.sort_values(by='count',ascending=False,inplace=True)
fig = px.bar(income_range , x= 'income label', y='count',
title='Income status of all borrowers')
fig.show()
Conclusion:
People having middle middle income(50,000-74,999 USD) and low middle income (25,000-49,999 USD) tool more loans.Job less and low income people have less chance to get loans from bank.
state_top10 = loan_data.pivot_table(index = 'borrowerstate', values = 'listingkey', aggfunc = 'nunique').reset_index()
state_top10 .rename(columns = {'listingkey':'count'}, inplace = True)
state_top10.sort_values(by='count',ascending=False,inplace=True)
top10_state = state_top10.head(10)
fig = px.bar(top10_state , x= 'borrowerstate', y='count',
title='Top 10 states of all Borrowers')
fig.show()
Conclusion:
Top 5 states of all Borrowers are from CA,NY,TX,FL and IL
ocuupation_top10 = loan_data.pivot_table(index = 'occupation', values = 'listingkey', aggfunc = 'nunique').reset_index()
ocuupation_top10.rename(columns = {'listingkey':'count'}, inplace = True)
ocuupation_top10.sort_values(by='count',ascending=False,inplace=True)
top10_occupation = ocuupation_top10.head(10)
fig = px.bar(top10_occupation, x= 'occupation', y='count',
title='Top 10 occupation of all Borrowers')
fig.show()
Conclusion:
Most of the borrowers occupation are not defined.May be self employed like property owner.But majority are with an occupation of Professional and Executive.
prosperscore = loan_data.pivot_table(index = 'prosperscore', values = 'listingkey', aggfunc = 'nunique').reset_index()
prosperscore .rename(columns = {'listingkey':'count'}, inplace = True)
prosperscore .sort_values(by='count',ascending=False,inplace=True)
fig = px.bar(prosperscore , x= 'prosperscore', y='count',
title='ProsperScore of all Borrowers')
fig.show()
Conclusion:
Majority of the borrowers are with a rating or score from 4 to 8.They have higher chance to approve loan.
The variables that are numeric are 'term', 'estimatedeffectiveyield', 'borrowerapr', 'borrowerrate', 'prosperrating_numeric', 'prosperscore', 'listingcategory_numeric', 'employmentstatusduration', 'statedmonthlyincome', 'monthlyloanpayment', 'recommendations', 'debtToIncomeratio', 'loanoriginalamount', 'percentfunded', 'investors'
cols_to_plot = loan_data[['term', 'estimatedeffectiveyield', 'borrowerapr', 'borrowerrate', 'prosperrating_numeric',
'prosperscore', 'listingcategory_numeric', 'employmentstatusduration', 'statedmonthlyincome',
'monthlyloanpayment', 'recommendations', 'debttoincomeratio', 'loanoriginalamount',
'percentfunded', 'investors']]
display(cols_to_plot.head())
| term | estimatedeffectiveyield | borrowerapr | borrowerrate | prosperrating_numeric | prosperscore | listingcategory_numeric | employmentstatusduration | statedmonthlyincome | monthlyloanpayment | recommendations | debttoincomeratio | loanoriginalamount | percentfunded | investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 36 | 0.07960 | 0.12016 | 0.0920 | 6.0 | 7.0 | 2 | 44.0 | 6125.000000 | 318.93 | 0 | 0.18 | 10000 | 1.0 | 1 |
| 3 | 36 | 0.08490 | 0.12528 | 0.0974 | 6.0 | 9.0 | 16 | 113.0 | 2875.000000 | 321.45 | 0 | 0.15 | 10000 | 1.0 | 158 |
| 4 | 36 | 0.18316 | 0.24614 | 0.2085 | 3.0 | 4.0 | 2 | 44.0 | 9583.333333 | 563.97 | 0 | 0.26 | 15000 | 1.0 | 20 |
| 5 | 60 | 0.11567 | 0.15425 | 0.1314 | 5.0 | 10.0 | 1 | 82.0 | 8333.333333 | 342.37 | 0 | 0.36 | 15000 | 1.0 | 1 |
| 6 | 36 | 0.23820 | 0.31032 | 0.2712 | 2.0 | 2.0 | 1 | 172.0 | 2083.333333 | 122.67 | 0 | 0.27 | 3000 | 1.0 | 1 |
sns.boxplot(x=loan_data['borrowerrate'])
<AxesSubplot:xlabel='borrowerrate'>
loan_data.hist('borrowerrate', bins = 10);
plt.xlabel('borrower interest rate');
plt.ylabel('Count');
plt.title("borrower interest rate of all Borrowers");
sns.boxplot(x=loan_data['statedmonthlyincome'])
<AxesSubplot:xlabel='statedmonthlyincome'>
loan_data_clean = loan_data.copy()
#Determine the upper and lower limits of outliers and remove the outliers(statedmonthlyincome) using IQR rule
uw_1 =(loan_data_clean['statedmonthlyincome'].quantile(0.75))+(1.5*((loan_data_clean['statedmonthlyincome'].quantile(0.75))-(loan_data_clean['statedmonthlyincome'].quantile(0.25))))
lw_1=(loan_data_clean['statedmonthlyincome'].quantile(0.25))-(1.5*((loan_data_clean['statedmonthlyincome'].quantile(0.75))-(loan_data_clean['statedmonthlyincome'].quantile(0.25))))
print(uw_1)
print(lw_1)
loan_data_clean = loan_data_clean.query(' @lw_1 <= statedmonthlyincome <= @uw_1 ')
display(loan_data_clean.head(2))
12750.0000005 -1916.6666675000001
| listingkey | term | loanstatus | estimatedeffectiveyield | borrowerapr | borrowerrate | prosperrating_numeric | prosperrating_alpha | prosperscore | listingcategory_numeric | ... | statedmonthlyincome | monthlyloanpayment | recommendations | debttoincomeratio | loanoriginalamount | percentfunded | incomerange | investors | borrowerstate | income label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 10273602499503308B223C1 | 36 | Current | 0.0796 | 0.12016 | 0.0920 | 6.0 | A | 7.0 | 2 | ... | 6125.0 | 318.93 | 0 | 0.18 | 10000 | 1.0 | $50,000-74,999 | 1 | CO | middle middle income |
| 3 | 0EF5356002482715299901A | 36 | Current | 0.0849 | 0.12528 | 0.0974 | 6.0 | A | 9.0 | 16 | ... | 2875.0 | 321.45 | 0 | 0.15 | 10000 | 1.0 | $25,000-49,999 | 158 | GA | low middle income |
2 rows × 25 columns
sns.boxplot(x=loan_data_clean['statedmonthlyincome'])
<AxesSubplot:xlabel='statedmonthlyincome'>
loan_data_clean.hist('statedmonthlyincome', bins = 20);
plt.xlabel('StatedMonthlyIncome');
plt.ylabel('Count');
plt.title("StatedMonthlyIncome of all Borrowers");
sns.boxplot(x=loan_data['loanoriginalamount'])
<AxesSubplot:xlabel='loanoriginalamount'>
#Determine the upper and lower limits of outliers and remove the outliers(loanoriginalamount) using IQR rule
uw_2 =(loan_data_clean['loanoriginalamount'].quantile(0.75))+(1.5*((loan_data_clean['loanoriginalamount'].quantile(0.75))-(loan_data_clean['loanoriginalamount'].quantile(0.25))))
lw_2=(loan_data_clean['loanoriginalamount'].quantile(0.25))-(1.5*((loan_data_clean['loanoriginalamount'].quantile(0.75))-(loan_data_clean['loanoriginalamount'].quantile(0.25))))
print(uw_2)
print(lw_2)
loan_data_clean = loan_data_clean.query(' @lw_2 <= loanoriginalamount <= @uw_2 ')
display(loan_data_clean.head(2))
26500.0 -9500.0
| listingkey | term | loanstatus | estimatedeffectiveyield | borrowerapr | borrowerrate | prosperrating_numeric | prosperrating_alpha | prosperscore | listingcategory_numeric | ... | statedmonthlyincome | monthlyloanpayment | recommendations | debttoincomeratio | loanoriginalamount | percentfunded | incomerange | investors | borrowerstate | income label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 10273602499503308B223C1 | 36 | Current | 0.0796 | 0.12016 | 0.0920 | 6.0 | A | 7.0 | 2 | ... | 6125.0 | 318.93 | 0 | 0.18 | 10000 | 1.0 | $50,000-74,999 | 1 | CO | middle middle income |
| 3 | 0EF5356002482715299901A | 36 | Current | 0.0849 | 0.12528 | 0.0974 | 6.0 | A | 9.0 | 16 | ... | 2875.0 | 321.45 | 0 | 0.15 | 10000 | 1.0 | $25,000-49,999 | 158 | GA | low middle income |
2 rows × 25 columns
sns.boxplot(x=loan_data_clean['loanoriginalamount'])
<AxesSubplot:xlabel='loanoriginalamount'>
loan_data_clean.hist('loanoriginalamount', bins = 20);
plt.xlabel('loan originalamount');
plt.ylabel('Count');
plt.title("loan original amount of all Borrowers");
sns.boxplot(x=loan_data['employmentstatusduration'])
<AxesSubplot:xlabel='employmentstatusduration'>
#Determine the upper and lower limits of outliers and remove the outliers(employmentstatusduration) using IQR rule
uw_3 =(loan_data_clean['employmentstatusduration'].quantile(0.75))+(1.5*((loan_data_clean['employmentstatusduration'].quantile(0.75))-(loan_data_clean['employmentstatusduration'].quantile(0.25))))
lw_3=(loan_data_clean['employmentstatusduration'].quantile(0.25))-(1.5*((loan_data_clean['employmentstatusduration'].quantile(0.75))-(loan_data_clean['employmentstatusduration'].quantile(0.25))))
print(uw_3)
print(lw_3)
loan_data_clean = loan_data_clean.query(' @lw_3 <=employmentstatusduration <= @uw_3')
display(loan_data_clean.head(2))
324.5 -143.5
| listingkey | term | loanstatus | estimatedeffectiveyield | borrowerapr | borrowerrate | prosperrating_numeric | prosperrating_alpha | prosperscore | listingcategory_numeric | ... | statedmonthlyincome | monthlyloanpayment | recommendations | debttoincomeratio | loanoriginalamount | percentfunded | incomerange | investors | borrowerstate | income label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 10273602499503308B223C1 | 36 | Current | 0.0796 | 0.12016 | 0.0920 | 6.0 | A | 7.0 | 2 | ... | 6125.0 | 318.93 | 0 | 0.18 | 10000 | 1.0 | $50,000-74,999 | 1 | CO | middle middle income |
| 3 | 0EF5356002482715299901A | 36 | Current | 0.0849 | 0.12528 | 0.0974 | 6.0 | A | 9.0 | 16 | ... | 2875.0 | 321.45 | 0 | 0.15 | 10000 | 1.0 | $25,000-49,999 | 158 | GA | low middle income |
2 rows × 25 columns
sns.boxplot(x=loan_data_clean['employmentstatusduration'])
<AxesSubplot:xlabel='employmentstatusduration'>
loan_data_clean.hist('employmentstatusduration', bins = 30);
plt.xlabel('employment status duration');
plt.ylabel('Count');
plt.title("Employment status duration");
display(loan_data[['statedmonthlyincome','loanoriginalamount','employmentstatusduration']].describe())
| statedmonthlyincome | loanoriginalamount | employmentstatusduration | |
|---|---|---|---|
| count | 76216.000000 | 76216.000000 | 76216.000000 |
| mean | 6002.833586 | 9294.999410 | 104.578028 |
| std | 5110.743506 | 6413.404265 | 95.836443 |
| min | 0.250000 | 1000.000000 | 0.000000 |
| 25% | 3583.333333 | 4000.000000 | 32.000000 |
| 50% | 5041.666667 | 8000.000000 | 76.000000 |
| 75% | 7250.000000 | 14603.250000 | 150.000000 |
| max | 483333.333333 | 35000.000000 | 755.000000 |
display(loan_data[['borrowerrate']].describe())
| borrowerrate | |
|---|---|
| count | 76216.000000 |
| mean | 0.193621 |
| std | 0.074088 |
| min | 0.040000 |
| 25% | 0.134900 |
| 50% | 0.184500 |
| 75% | 0.254900 |
| max | 0.360000 |
loan_data_clean.dtypes
listingkey object term category loanstatus object estimatedeffectiveyield float64 borrowerapr float64 borrowerrate float64 prosperrating_numeric float64 prosperrating_alpha object prosperscore float64 listingcategory_numeric int64 employmentstatus object occupation object employmentstatusduration float64 isborrowerhomeowner bool incomeverifiable bool statedmonthlyincome float64 monthlyloanpayment float64 recommendations int64 debttoincomeratio float64 loanoriginalamount int64 percentfunded float64 incomerange object investors int64 borrowerstate object income label object dtype: object
numeric_vars = ['estimatedeffectiveyield', 'borrowerapr', 'borrowerrate',
'prosperrating_numeric', 'prosperscore',
'listingcategory_numeric','employmentstatusduration',
'statedmonthlyincome', 'monthlyloanpayment',
'recommendations', 'debttoincomeratio', 'loanoriginalamount','percentfunded', 'investors']
#numeric_vars_1 = ['BorrowerRate',
# 'ProsperScore',
#'StatedMonthlyIncome', 'MonthlyLoanPayment',
#'LoanOriginalAmount', 'Investors' ,'LoanOriginalAmount_log',
#'Investors_log']
categoric_vars = ['loanStatus', 'prosperrating_alpha', 'employmentstatus', 'occupation',
'incomerange', 'borrowerstate', 'term', 'isborrowerhomeowner']
# correlation plot
plt.figure(figsize = [12, 8])
sns.heatmap(loan_data_clean[numeric_vars].corr(), annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 0)
plt.show()
*Conclusion:
# To identify how the applicants rating is affected by Employment status and Monthly income
plt.figure(figsize = [15, 7])
ax = sns.pointplot(data = loan_data_clean, x = 'employmentstatus', y = 'statedmonthlyincome', hue = 'prosperrating_numeric',
dodge = 0.3, linestyles = "")
plt.title('Applicants - Rating across Monthly Income and EmploymentStatus');
Conclusion:
plt.figure(figsize = [15, 7])
ax = sns.pointplot(data = loan_data_clean, x = 'prosperrating_numeric', y = 'borrowerrate', hue = 'isborrowerhomeowner',
dodge = 0.3, linestyles = "");
plt.title('Applicants - Home ownner status across ProsperRating and BorrowerRate ');
Conclusion:
We observe that without homeowner tend to have a higher interest rate, and thus lower rating.However homeowner tends to have lower interest rate and higher rating. So we can safely say that homeowner is safest bet when giving a loan.
To summarize this report, I believe that the loan approval status is heavily influenced by the applicant's details on income range, house owner status, and job status.